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Abstract 

Factorial clustering methods have been developed in recent years thanks to the improving of com- 
putational power. These methods perform a linear transformation of data and a clustering on 
transformed data optimizing a common criterion. Factorial PD-clustering is based on Probabilistic 
Distance clustering (PD-clustering). PD-clustering is an iterative, distribution free, probabilistic, 
clustering method. Factorial PD-clustering makes a linear transformation of original variables into 
a reduced number of orthogonal ones using a common criterion with PD-Clustering. This paper 
demonstrates that Tucker3 decomposition permits to obtain this transformation. Factorial PD- 
clustering exploits alternatively Tucker3 decomposition and PD-clustering on transformed data 
until convergence is achived. This method can significantly improve the algorithm performance; 
large datasets can thus be partitioned into clusters with increasing stability and robustness of the 
results. 

1 Introduction 

In a wide definition Cluster Analysis is a multivariate analysis technique that seeks to organize 
information about variables in order to discover homogeneous groups, or "clusters" into data. 
The presence of groups in data depends on the association structure over the data. Clustering 
algorithms aim at finding homogeneous groups with respect to their association structure among 
variables. Proximity measures or distances can be properly used to separate homogeneous groups. 
A measure of the homogeneity of a group is the variance. Dealing with numerical linearly inde- 
pendent variables, a clustering problem consists in minimising the sum of the squared Euclidean 
distances within classes: the within groups deviance. 

"The term cluster analysis refers to an entire process where clustering maybe only a step" 



| Gordon, 1999] . According to Gordon's definition cluster analysis can be sketched in three main 



stages: 

• transformation of data into a similarity/dissimilarity matrix; 

• clustering; 

• validation. 

Transformation of data into similarity/dissimilarity measures depends on data type. On the trans- 
formed matrix a clustering method can be applied. Clustering methods can be divided into three 
main types: hierarchical, non hierarchical and fuzzy | |Wedel and Kamakura, 19991 . Non hierarchi- 



cal clustering methods are considered in this paper. Among them the most well-known and used 
method is k-means. It is an iterative method that starts with a random initial partition of units and 
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keeps reassigning the units into clusters based on the squared distances between the units and the 



clusters' centers until the convergence is reached. Interested readers can refer to flGordon, 19 99 1. 
Major k-means issues are that clusters can be sensitive to the choice of the initial centers and that 
the algorithm could converge to local minima. 

The choice of the number of clusters is a well known problem of non hierarchical methods, 
this problem will not be dealt with in this paper where the number of clusters is assumed as a priori 
known. 

Non hierarchical clustering methods performance can be strongly affected by the dimensionality. 
Let us consider an n x / data matrix X, with n number of units and J number of variables. Non 
hierarchical methods easily deal with large n, however they can fail when J becomes large or 
very large and when the variables are correlated. They do not converge or they converge into a 
different solution at each iteration. To cope with these issues, the French school of data analysis 



I Lebart et al., 1984] suggested a strategy to improve the overall quality of clustering that con- 
sists in two phases: variables transformation through a factorial method and clustering method on 
transformed variables. Arabie and Hubert in 1994 [Arabie et Hubert., 1994] fourthly formalized 



the method and called it tandem analysis. 

The choice of the factorial method is an important and tricky phase because it will affect the 
results. Principal factorial methods are [Le Roux and Rouanet, 20041 : 



• quantitative data; 

- Principal Component Analysis (PC A); 

• binary data; 

- Principal Component Analysis (PC A); 

- Correspondence Analysis (CA); 

- Multiple Correspondence Analysis (MCA); 

• nominal data; 

- Multiple Correspondence Analysis (MCA); 

The second phase of the tandem analysis consists in applying clustering methods. 
Tandem analysis exploits the factor analysis capabilities that consist in obtaining a reduced num- 
ber of uncorrected variables which are linear transformation of the original ones. This method 
gives more stability to the results and makes the procedure faster. However tandem analysis min- 
imises two different functions that can be in contrast and the first factorial step can in part obscure 
or mask the clustering structure. 

This technique has the advantage of working with a reduced number of variables that are orthogo- 
nal and ordered with respect to the borrowed information. Moreover dimensionality reduction per- 
mits to visualize the cluster structure in two or three dimensional factorial space IPalumbo et al., 2008 1 



To cope with these issues Vichi and Kiers IVichi and Kiers, 2001] proposed Factorial k-means 



analysis for two-way data. The aim of this method is to identify the best partition of the objects 
and to find a subset of factors that best describe the classification according to the least squares 
criterion. Two steps Alternating Least Squares algorithm (ALS) based on solves this problem. The 
advantage of Factorial k-means is that the two steps optimize a single objective function. However 
the k-means algorithm itself, and as a consequence the tandem analysis and Factorial k-means, 
is based on the arithmetic mean that gives rise to unsatisfactory solutions when clusters have not 
spherical shape. 
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Probabilistic clustering methods may allow us to obtain better results under this condition 
because they assign a statistical unit to a cluster according to a probability function that can be 
independently defined with respect to the arithmetic mean. 

Probabilistic Distance clustering (PD-clustering) [Ben-Israel and Iyigun, 2008] is an iterative, dis- 
tribution free, probabilistic, clustering method. PD-clustering assigns units to a cluster according 
to their probability of belonging to the cluster, under the constraint that the product between the 
probability and the distance of each point to any cluster center is a constant. 
When the number of variables is large and variables are correlated, PD-clustering becomes unsta- 
ble and the correlation between variables can hide the real number of clusters. A linear transfor- 
mation of original variables into a reduced number of orthogonal ones using common criteria with 
PD-clustering can significantly improve the algorithm performance. The objective of this paper is 
to introduce an improved version of PD-clustering called Factorial PD-clustering (FPDC). 

The paper has the following structure: section 2: detailed presentation of PD-clustering method; 
section 3: presentation of our suggestion for a Factorial PD-clustering method; section 4: applica- 
tion of Factorial PD-clustering on a simulated case study and comparison with k-means. 



2 Probabilistic Distance Clustering 

PD-clustering is a non hierarchical algorithm that assigns units to clusters according to their be- 



longing probability to the cluster. According to Ben-Israel and Iyigun | Ben-Israel and Iyigun, 2008] 
notation we introduce PD-clustering. Given some random centers, the probability of any point to 
belong to each class is assumed to be inversely proportional to the distance from the centers of the 
clusters. Given an X data matrix with n units and / variables, given K clusters that are assumed 
not empty, PD-Clustering is based on two quantities: the distance of each data point .t, from the K 
cluster centers c k , d(xi,c k ), and the probabilities for each point to belong to a cluster, p(xi,Ck) with 
k = 1, . . . ,K and i = 1, . . . ,n. The relation between them is the basic assumption of the method. 
Let us consider the general term Xij of X and a center matrix C, of elements qy with k = 1 , . . . , K, 
i = 1, . . . ,n and j = 1,. . . ,J, their distance can be computed according to different criteria, the 
squared norm is one of the most commonly used. The generic distance d(xi,c k ) represents the 
distance of the generic point i to the generic center k. The probability p(xi,c k ) of each point to 
belong to a cluster can be computed according to the following assumption: the product between 
the distances and the probabilities is a constant depending on xf. F(xt). 

For short we use p, k = p(xj,c k ) and dk(xi) = d(xi,c k ); PD-clustering basic assumption is expressed 
as: 

Pikdk{xi) =F{xi). (1) 

for a given value of Xj and for all k = 1 , . . . , K. 

At the decreasing of the point closeness from the cluster center the belonging probability of 
the point to the cluster decreases. The constant depends only on the point and does not depend on 
the cluster k. 

Starting from fhe[T]it is possible to compute pi k : 

Pimd m (xi) = p ik d k (xi); p im = P * dk X ' ; Vm = 1,. . . ,K (2) 

u m (Xj j 

The term p% is a probability so, under the constraint Y%i=i Pim = 1, the sum over m of |2]becomes: 
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( £ f d k (xj) \\ Um^kdmjXi) 

Starting from fhe[I]and using[3]it is possible to define the value of the constant F(jc,-): 

F(x,-) = p ik d k (xi),k =l,...K, 

F(*i) = v ^ dm{ f ( y (4) 

Lm=lllk^m a k[Xi) 

The quantity F(xj), also called 7o/?i£ Distance Function (JDF), is a measure of the closeness of 
Xj from all clusters' centers. The JDF measures the classificability of the point x, with respect to 
the centers c k with k = l,...,K. If it is equal to zero, the point coincides with one of the clusters' 
centers, in this case the point belongs to the class with probability 1. If all the distances between 
the point x, and the centers of the classes are equal to dj, F(xj) = di/k and all the belonging prob- 
abilities to each class are equal: p% = l/K. The smaller the JDF value, the higher the probability 
for the point to belong to one cluster. 

The whole clustering problem consists in the identification of the centers that minimises the JDF. 
Without loss of generality the PD-Clustering optimality criterium can be demonstrated according 
to k = 2. 

min [di {xi)p n + d 2 {xi)p n ) (5) 
s.t. p a + pa = 1 

Pa,Pi2 > 

The probabilities are squared because it is a smoothed version of the original function. The La- 
grangian of this problem is: 

{pa ,pa, %>) = d\ (xj)pa + d 2 (xj)p a - A {pa + Pa - 1) (6) 

Setting to zero the partial derivates with respect to pa and pa, substituting the probabilities [3] and 
considering the principle pad\ (x,) = pn.d 2 {x\) we obtain the optimal value of the Lagrangian. 

di{xi) + d 2 {xi) 

This value coincides with the JDF, the matrix of centers that minimises this principle minimises 
the JDF too. Substituting the generic value d k {xi) with ||x ; - — c k \\, we can find the equations of the 
centers that minimise the JDF (and maximize the probability of each point to belong to only one 
cluster). 

v / U k {Xj) \ 

^..vVl/ i v"<i,v/i; 

where 

u k { Xi ) = -^- (9) 
d k {xi) 

As showed before, the value of JDF at all centers k is equal to zero and it is necessarily positive 
elsewhere. So the centers are the global minimiser of the JDF. Other stationary points may exist 
because the function is not convex neither quasi-convex, but they are saddle points. 
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There are alternative ways for modeling the relation between probabilities and distances, for 
example the probabilities can decay exponentially as distances increase. In this case the probabil- 
ities pjk and the distances dk{xj) are related by: 

p ik e d *M=E(xi), (10) 

where E(xt) is a constant depending on x,. 

Many results of the previous case can be extended to this case by replacing the distance <4(x ( ) with 
g4(*i). Interested readers are referred to Ben-Israel and Iyigun | Ben-Israel and Iyigun, 2008| . 



The optimization problem presented in [5] is the original version proposed by Ben-Israel and 
Iyigun. Notice that in the optimization problem the probabilities pk are considered in squared 
form. The Authors affirm that it is possible to consider dt as well d\. Both choices have some 
advantages and drawbacks. Squared distances offer analytical advantages due to linear derivates. 
Using simple distances endures more robust results and the optimization problem can be recon- 
ducted to a Fermat-Weber location problem. The Fermat- Weber location problem aims at finding 
a point that minimises the sum of the Euclidean distances from a set of given points. This problem 



can be solved with the Weiszfeld method [Weiszfeld, 1937] . Convergence of this method was es- 



tablished by modifying the gradient so that it is always defined [Khun, 1973] . The modification is 



not carried out in practice. The global solution is guaranteed only in case of one cluster. Dealing 
with more than one cluster, in practice, the method converges only for a limited number of centers 
depending on the data. 

In this paper we consider the squared form: 

where k = 1 , . . . ,K and i = I,... ,N. Starting from the [TT] the distance matrix D of order n x K is 
defined, where the general element is dk(xj). The final solution JDF is obtained minimising the 
quantity: 

JDF = 1 1 d kte)pk = ttt (*y - c kjfp% d2) 

i=\k=l i=l 7 =U=l 



n J K 



JDF = argmin£ £ £ (xy - c kj ) 2 p\. (13) 
i=ij=\k=i 

Where c# is the generic center and dk(xj) is defined infTTI 

The solution of PD-clustering problem can be obtained through an iterative algorithm. 



The algorithm convergence is demonstrated in [Iyigun, 2007 1. 
Each unit is then assigned to the k cluster according to the highest probability that is computed a 
posteriori using the formula in equation [3] 

3 Factorial PD-Clustering 

When the number of variables is large and variables are correlated, PD-Clustering becomes very 
unstable and the correlation between variables can hide the real number of clusters. A linear 
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Algorithm 1 Probabilistic Distance Clustering Function 



function PDC(X, K) 

C -h- rand(A", J) > Matrix Ckj is randomly initialised 

JDF <— 1 / eps > JDF is initialised to the maximum 

D <— > Initialise the array D, of dimension n x /f , to 
p <— 4 > Initialise to 4 the probability vector p of « elements 
repeat 

for fe= l,Kdo 

Di «— distance(X ,C(k)) t> distances of all units from the centre k according to formulafTTI 

end for 

JDFO <- JDF t> Current JDF is stored in JDFO 

C C* > Centres are updated according to formula[8] 

JDF <i- jd f(D) D> «- jdf(D) implements the formula[4] 
until JDFO > JDF 

P «— compp(D) t> function compp implements the formula[3] 
return C,P,JDF 
15: end function 



transformation of original variables into a reduced number of orthogonal ones can significantly 
improve the algorithm performance. Combination of PD-Clustering and variables linear transfor- 
mation implies a common criterion. 

This section shows how the Tucker3 method | Kroonenberg, 2008 ] can be properly adopted for 
the transformation into the Factorial PD-Clustering; an algorithm is then proposed to perform the 
method. 



3.1 Theoretical approach to Factorial PD-clustering 

Firstly we demonstrate that the minimization problem in [12] corresponds to the Tucker3 decompo- 
sition of the distance matrix G of general elements gij k = \ x ij — c kj\ - It is a 3-way matrix nxJxK 
where n is the number of units, / the number of variables and K the occasions. For any q with 
k = I, . . . ,K, a.Gk n x J distances matrix is defined. In matrix notation: 

G k = X-hc k (14) 

where h is an n x 1 column vector with all terms equal to 1 ; X and c& (k = 1 , . . . , K) have been 
already defined in section [2] 

Tucker3 method decomposes the matrix G in three components, one for each mode, in a full 
core array A and in an error term E. 

R Q s 

Sijk = Z L L h-qMirbjqVks) + e ijk , (15) 
r=l^=li=l 

where X rqs and e\j k are respectively the general terms of the three way matrix A of order RxSxQ 
and E of order nxJxK; 

Ui r , bj q and are respectively the general terms of the matrix U of order nxR, B of order 
J x Q and V of order K x S, with i= 1 , . . . , n, j = l,...,J,k= l,...,K. 

As in all factorial methods, factorial axes in Tucker3 model are sorted according to explained 
variability. The first factorial axes explain the greatest part of the variability, latest factors are 
influenced by anomalous data or represent the ground noise. For this reason the choice of a 
number of factors lower than the number of variables makes the method externally robust. Ac- 
cording to [Kiers and Kinderen, 2003 j the choice of the parameters R, Q and S is a ticklish prob- 
lem because they define the overall explained variability. The interested readers are referred to 
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I Kroonenberg, 2008 1 for the theoretical aspects concerning this choice. We use an heuristic ap- 
proach to cope with this crucial issue: we choose the minimum number of factors that corresponds 
to a significant value of the explained variability. 

The coordinates x* q of the generic unit into the space of variables obtained through Tucker3 
decomposition are obtained by the following expression: 



7=1 



Y^Xijbjq. (16) 



Finally on these x* iq coordinates a PD-Clustering is applied in order to solve the clustering problem. 
Let us start considering the expression [121 it is worth noting that minimising the quantity: 



/£)F = £tiEj=ilf=i(%-^) 2 j p| s.t. ELilf=iP|<", d7) 

is equivalent to compute the maximum of — £" =1 Ly=i Lf=i ( x ij ~ c kj) 2 p\, under the same con- 
straints. 

Taking into account the Proposition 1 (proof in lA. lb and the following lemma, we demonstrate 
that the Tucker3 decomposition is a consistent linear variable transformation that determines the 
best subspace according to the PD-clustering criterion. 

Proposition 1 Given an unknown matrix B of generic element bj m and a set of coefficients < 
Vim < 1> with m — 1 , . . . ,M and i= 1 , . . . , n. Maximising 



M 71 

II 

m= 1 i= 1 

-M v» Mr2 



s -t- L m =i T!i=\ Vim — n JJ equivalent to solve the equation 

M n M n 

£ Y, b i'nVim = ^ £ J^Vim, 

m=\i=\ m=l('=l 

where jJ, > 0. 



Lemma. Tucker3 decomposition permits to define the best subspace for the PD-clustering. 
We consider the proposition of the Proposition 1 where: 

M = K 
J 

bik = Z( x U~ c kj) 2 (18) 
7=1 

and 

Vk = Pik, with i = l,...,n: k = l,...,K 



Let us assume that Cjy and pik are known, replacing (x,y —Ckj) with in[T7]we develop the 
following squared form: 

max (-Lf =1 £ti (LUsm)pl) 
s.t. iLiiLi/>l<« 



7 



according to the Proposition 1 we obtain: 

K n ( J \ K n 

/t=li=l \y=i / k=\i=\ 

The value of /I that optimize fhe[l9]can be find trough the singular value decomposition of the 
matrix G, which is equivalent to the following Tucker3 decomposition: 

R Q s 

Sijk = £ £ T,^-c I s(ui r bj (j V ks )+eij k , 
r=l q=\ s=\ 

with i = 1 , . . . , n, j = l,...,J,k = l,...,K. 



Defining with: R number of components of U, Q number of components of B and S number 
of components of V. 
In matrix notation: 

G = UA(V' '®B')+E (20) 



The Proposition 1 and the Lemma 1 demonstrate that the Tucker3 transformation of the dis- 
tance matrix G minimises the JDF. The following subsection presents an iterative algorithm to 
alternatively calculate qy and on one hand, and bj q on the other hand, until the convergence is 
reached. In IA.2I we empirically demonstrate that the minimisation of the quantity in the formula 
[T7]converges at least to local minima. 

3.2 Factorial PD-clustering iterative algorithm 

Let us start considering the equation [13J where we apply the linear transformation Xijbj q to 
according to[T6l 

n Q K 

JDF = argmin£ £ £ (x* q - c kq fp\. (21) 

L B i=\q=\k=\ 

Let us note that in formula I2T1 
Xij and bj q are the general elements of the matrices X and B that have been already defined in 
section [3~T1 

c/cq is the general element of the matrix C, (see eq. [8]). 

It is worth to note that C and B are unknown matrices and is determined as C and B are fixed. 
The problem does not admit a direct solution and an iterative two steps procedure is required. The 
two alternative steps are: 

• Linear transformation of original data; 

• PD-Clustering on transformed data. 

The procedure starts with a pseudorandomly defined centre matrix C of elements c^j with k = 
1, . .. ,K and j = 1, . . . ,/. Then a first solution for probabilities and distance matrices is computed 
according to [14] Given the initial C and X, the matrix B is calculated; once B is fixed the matrix 
C is updated (and the values p^ are consequently updated). Last two steps are iterated until the 
convergence is reached: JDF — JDF > 0, where t indicates the number of iterations. 

Here under the procedure is presented according to the usual flow diagram notation: 



Remark that the Tucker3 function is in MatLab Toolbox N-way flChen, 20T0| . 
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Algorithm 2 Factorial Probabilistic Distance Clustering 



15: 



function FPDC(X,K) 
JDF <- 1 /eps 
G^O 

C^mnd(K,J) 
repeat 

for k= l,Kdo 

<— distance(X ,C(k)) 
end for 

B^- Tucker3(G) 
X* ^XB 
JDFO 4r- JDF 
(C,P,JDF)^- PDC{X*,K) 
until JDFO > JDF 
return C, P 
end function 



> JDF is initialised to the maximum 
> Initialise the array G, of dimension n x J x K, to 

> Initialise to 4 the probability vector p of « elements 

> Matrix Ckj is randomly initialised 



> G,t distances of all units from the centre k 

> Tucker3 fun. in MatLab Toolbox N-way |Chen, 2010) 

t> Current JDF is stored in JDFO 
t> PDC() function is defined by the algorithmfT] 



4 Application on a simulated dataset 

In order to evaluate the performance of FPDC it has been applied on a simulated dataset. The 
dataset has been created according to Maronna and Zamar [Maronna and Zamar, 2002 1 procedure 
and notations. 

Every cluster has been obtained generating uncorrelated normal data Xj ~ N(0,I) where / is a 
/ x / identity matrix. Each element Xj has been transformed into yt = Ex, where £ 2 is a covariance 
matrix with Ljj = 1 and E y > = p for r ^ j. For every cluster 100 vectors yt with 7 variables 
have been generated. Every cluster has been centered on points which are uniformly distributed 
on a hypersphere. Each cluster has been contaminated at a level e = 20%, cluster contamination 
is generated according to a normal distribution yi ~ N (raoy/J where ao is a unitary vector 
generated orthogonal to (l,l,...,l) r . The parameter r measures the distance between the outliers 
and the cluster center. To avoid that outliers overlap the elements of the clusters the minimum 

value of r is r OT! „ = - — V^' 1 J_ \A y >' °) _ j n ^ s case we c hosen r = 4 that verifies r > r m! „. 
In order to evaluate the stability of the results each method has been iterated 100 times, JDF has 
been measured at each iteration; results are represented in fig. [2 

The modal percentile is obtained in 59% of cases, the JDF is included in the interval [975 , 983] . 
In this percentile the maximum variation in clustering structure is 1% that corresponds to six units. 
In 59% of cases the error term is in the interval [0,21%, 1,5%]. The clustering structure on the 
first three variables is represented in fig. [3] 

A well known problem in cluster analysis is the validation of clustering structure. There is 
no index that measures clustering results because each clustering method optimizes a different 
function. In order to evaluate the cluster partition a density based silhouette plot (dbs) can be 
used. According to this method the dbs index is measured for all the observations x ; , all the clusters 
are sorted in a decreasing order with respect to dbs and plotted on a bar graph, fig. |4] Usually 
euclidean distance is used to measure the distance between clusters center and each datapoint; 
however Euclidean distance is not suitable dealing with probabilistic clustering. A measure of dbs 
for probabilistic clustering method is proposed in Menardi [Menardi, 201 1[ . An adaptation of this 
measure for FPDC is the following one: 



dbsi = VP "" 1/ / . . , (22) 



max != i,...,„|^(^ 
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Figure 1 : Scatter plot matrix of the simulated dataset. The dataset represents 4 normally generated 
cluster with a level of contamination of 20% and correlated according to the scheme in the section 
IU Displayed data have been standardized. 



where m\ is such that x\ belongs to cluster k and mi is such that p, m , is maximum for m ^ m^. The 
graphic shows that the clustering structure is correct. 

Although an index that compares clustering structure does not exist, in order to point out the 
quality of FPDC the dataset has been partitioned using k-means method too. The method has been 
iterated 100 times, the within variance has been measured at each iteration, results are represented 
in fig. B 

The results have an high variability, the modal case is obtained 15% of times, the first per- 
centile is obtained 24% of times. In all resulting clustering structures there is high percentage of 
error due to outliers. Results obtained in the modal case are represented in fig. [6] 

5 Conclusion and perspectives 

In this paper a new factorial two-step clustering method has been brought up: Factorial PD- 
clustering. This method can be inlaid into a new field of clustering techniques which has been 
developed in recent years: iterative clustering methods. Two-step clustering methods were pro- 
posed by the French school of data analysis in order to cope with some clustering issues. Thanks to 
computer developing, recently, iterative clustering methods have been introduced. These methods 



10 




Figure 2: The bar-graph represents the distribution of JDF obtained through 100 FPDC iterations 
on the simulated dataset. The picture shows the stability of the results. The modal percentile is 
[975,983] and corresponds to 59% of cases. 




Figure 3: The figure shows the FPDC results of the simulated dataset composed by 4 clusters. 
The axes correspond to the first 3 simulated variables see also [T] Colors and symbols are referred 
to FPDC results. The misclassification error rate is [0,21%, 1,5%]. 

optimize a common criterion iteratively performing a linear transformation of data and a cluster- 
ing optimizing a common criterion. Factorial PD-clustering performs a linear transformation of 
data and Probabilistic D-clustering iteratively. Probabilistic D-clustering is an iterative, distribu- 
tion free, probabilistic, clustering method. When the number of variables is large and variables 
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Figure 4: The figure represents density based silhouette plot on clusters obtained in the modal 
value of the JDF on 100 FPDC iterations. The graphic shows that points have been rightly classi- 
fied. 



15 



10 



1 — 
1.4 



1.45 




1.5 1.55 1.6 

within variance 



1.65 



x 10 



Figure 5: The bar-graph represents the distribution of within variance obtained through 100 k- 
means iterations on the simulated dataset. The picture shows the stability of the results. The modal 
percentile corresponds to 24% of cases. 



are correlated PD-Clustering becomes unstable and the correlation between variables can hide the 
real number of clusters. A linear transformation of original variables into a reduced number of or- 
thogonal ones using common criteria with PD-Clustering can significantly improve the algorithm 
performance. Factorial PD-clustering allows to work with large dataset improving the stability 
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Figure 6: The figure shows the k-means results of the simulated dataset composed by 4 clusters. 
The axes correspond to the first 3 simulated variables (see alsoQ]). Colors and symbols are referred 
to k-means results. The method does not find the right clustering structure in the dataset. 



and the robustness of the method. 

An important issue in the future research is the FPDC generalization to the case of categorical 
data. Dealing with big nominal and binary data matrices, the sparseness of data and the non- 
linearity in the association can be more prejudicial to the overall cluster stability. In this context, 
factorial clustering represents a suitable solution. Some methods have been already presented, it 
is worth mentioning the contributions of Hwang et al. | Hwang et al., 2006 1 and of Palumbo and 
Iodice D'Enza [Iodice D'Enza and Palumbo, 2010] , in the case of nominal data and of binary data, 
respectively. 



A Appendix 

This appendix contains two elements: the first one is the proof of proposition 1 and the second 
one deals with an empirical approach to the algorithm convergence. 

A.l Proof of the Proposition 1 

Proposition 1 Given an unknown matrix B of generic element bj m and a set of coefficients < 
Wim < 1» with m — I,... ,M and i = 1 , . . . , n. Maximising 

M n 

171— \ 1=1 

s - t - Lm=i T!i=\ Yhn — n i s equivalent to solve the equation 

M n M n 

£ Y,b im y im = ii £ £y™, 

m= 1 /= 1 m= 1 (' = 1 
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where jit > 0. 



Proof (Proposition 1). To prove the proposition we introduce the Lagrangian function: 

M n M n 

^=-EE**y&+m(EEv£,-») 

m=\i=\ m=li=\ 

where /I is the Lagrange multiplier. Let us consider the first derivative of j£? w.r.t. Yim equal to . 



M n 



dijfi 



2 E E b imWim + 2jU E E ^™ = 



which is equivalent to 



im m=li=\ 



M n 



M n 

EE 

m=l i'=l 



M n 



E E *™ = n E E vfe 



m= 1 i= 1 



m= 1 i'= 1 



A.2 FPD-Clustering algorithm convergence 

In general the proof of the algorithm convergence requires the demonstration of the convexity 
of the objective function. Dealing with multivariate data, the analytical proof of the convexity 
becomes a complex issue. In most multivariate situations the empirical evidence is a satisfactory 
approach to verify the algorithm convergence. Moreover the high capacity of modern CPU permits 
to get the minimum, avoiding local minima, through the multiple starts of the algorithm. This 
section aims at empirically showing the procedure convergence whereas a simulation study has 
been conducted by ITortora and Marino, 20111 . The proposition states that the convergence to a 
global or to a local maximum is guaranteed. Two data sets are generated; the first one is the one 
used in section |4] The second one is a simulated 450 x 2 four clusters dataset where variables 
are independent (see fig. [8]). The four clusters have been generated according to four normal 
distributions with different number of elements. 





1 2 3 4 5 6 7 

iterations 



Figure 7: The displays represent the JDF behavior at each iteration of FPDC algorithm obtained 
along 100 iterations on two simulated datasets. 

Figure |7] represents the following results: on the left-hand side the convergence of the dataset 
one, the right-hand side of the dataset two. The horizontal axis represents the number of iterations, 
the vertical refers to the value of JDF. Each broken line represents the value of the criterion at each 
iteration. When convergence is reached the line is straight and parallel to the horizontal axis. In 
both cases the procedure converges in a limited number of iterations. It is worth to note that the 
first iteration is not counted. 
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Figure 8: The figure represent the simulated 450 x 2 four clusters dataset. 
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